## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
There are 4898 white wines in our dataset with 12 features for EDA.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The quality looks rather normally distributed, no transformation needed.
We can see here that the majority of white whines has a quality rating between 5 and 7, there are very limited counts at the outer boundaries of the histogram.
The fixed acidity looks rather normally distributed, maybe a little skewed to the right. I observed some outliers below 4 or beyond 11, which are not included in the histogram. I reduced the binwidth for a better view. For now, we keep the skewness in mind as it might be worth testing log10 transformation later on.
Looking at the histogram for the volatile acidity, this appears even more right skewed than the fixed acidity. I set the histogram limits from 0.1 to 0.7, knowing that there are wines with more volatile acidity in our population. The second chart shows the log10 transformed version, which comes much closer to a normal distribution, we might be able to use that later on.
Citric acid looks rather normally distributed, even though we have surprisingly many wines at 0.48/0.49 and 0.73/0.74.
Residual sugar is clearly skewed to the right, many wines are at a peak below 2.5. The boxplot over the scattered points shows the high number of outliers in the fourth quartile. The log10 doesn’t look like a bell curve, there is a drop in the middle, let’s see how we can deal with this later on.
Chlorides are a little right-skewed, the log10 transformation puts it in a rather normal distribution.
Free sulfur dioxide looks rather normally distributed. Again, I set the xlims.
Total sulfur dioxide also looks rather normally distributed. Limits for x were modified.
Density is interesting, distribution rather normal but it seems that there are high counts followed by low counts followed by high counts when we move along the x axis. Maybe something due to measuring with different lab equipment.
pH value looks straightforward normally distributed. No xlimits this time.
Sulphates are again right-skewed, the log10 transformation creates our normal distribution. No xlimites this time.
Alcohol is to some extent a special case. It’s not really normally distributed, but transformations with log10 or sqrt don’t change that, might be more depth of transformations needed at a later stage.
There are 4898 white wines in our dataset with 12 numerical features:
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume)
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Key statistics for these variables are listed below:
1 - fixed acidity (range: 3.80 - 14.20, mean: 6.86) 2 - volatile acidity (range: 0.08 - 1.10, mean: 0.28) 3 - citric acid (range: 0.00 - 1.66, mean: 0.33) 4 - residual sugar (range: 0.60 - 65.80, mean: 6.39) 5 - chlorides (range: 0.01 - 0.35, mean: 0.05) 6 - free sulfur dioxide (range: 2.00 - 289.00, mean: 35.31) 7 - total sulfur dioxide (range: 9.00 - 440.00, mean: 138.40) 8 - density (range: 0.99 - 1.04, mean: 0.99) 9 - pH (range: 2.72 - 3.82, mean: 3.19) 10 - sulphates (range: 0.22 - 1.08, mean: 0.49) 11 - alcohol (range: 8.00 - 14.20, mean: 10.51) 12 - quality (range: 3.00 - 9.00, mean: 5.88)
Details for quality, our dependent variable aka the main feature of interest in our dataset for my analysis:
Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 5.000 6.000 5.878 6.000 9.000
For exploring what contributes to the quality derived from sensory tests, I will have a closer look at all other variables, maybe some new learnings are in for me as I’m not the white wine expert beyond knowing what alcohol and pH mean.
Other observations:
We can see here that the majority of white whines has a quality rating between 5 and 7, there are very limited counts at the outer boundaries.
Binwidth modifications and xlimit reductions were usually helpful for the plots, many variables are more or less skewed, straining the plot limits.
I made some log10 transformations to get a better plot with regards to the normal distribution, specifically for volatile acidity, residual sugar, chlorides and sulphates. This information will be useful for later analysis as we should try to work with normal distributions to avoid misleading results.
Residual sugar is clearly skewed to the right, many wines are at a peak below 2.5. The log10 doesn’t look like a bell curve, there is a drop in the middle, let’s see how we can deal with this later on.
Density is interesting, distribution rather normal but it seems that there are high counts followed by low counts followed by high counts when we move along the x axis. Maybe something due to measuring.
Alcohol is to some extent a special case. It’s not really normally distributed, but transformations with log10 or sqrt don’t change that, might be more depth of transformations needed at a later stage, maybe we can also work with ratios.
https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
White Wines dataset
That’s helpful. We’re focussing on quality, so here are some observations:
Amongst the others, following observations are interesting:
We need to keep in mind the rule of thumb of small relatedness starting with r >= 0.3, which we can show here for some cases. Yet, this also depends on the case number, we have plenty of cases in our dataset and will test significance in regression models.
With the univariate histograms and correlation plot we’re good to move on with more bivariate plots.
## item group1 vars n mean sd median trimmed mad
## X11 1 3 1 20 0.3332500 0.14082721 0.26 0.3165625 0.088956
## X12 2 4 1 163 0.3812270 0.17346335 0.32 0.3605725 0.111195
## X13 3 5 1 1457 0.3020110 0.10006628 0.28 0.2917138 0.074130
## X14 4 6 1 2198 0.2605641 0.08814208 0.25 0.2517670 0.074130
## X15 5 7 1 880 0.2627670 0.09110644 0.25 0.2554901 0.088956
## X16 6 8 1 175 0.2774000 0.10802942 0.26 0.2667730 0.103782
## X17 7 9 1 5 0.2980000 0.05761944 0.27 0.2980000 0.044478
## min max range skew kurtosis se
## X11 0.17 0.640 0.470 0.8810200 -0.6840048 0.031489921
## X12 0.11 1.100 0.990 1.3750398 2.1528222 0.013586698
## X13 0.10 0.905 0.805 1.4260309 3.7511289 0.002621549
## X14 0.08 0.965 0.885 1.5315969 5.0307374 0.001880050
## X15 0.08 0.760 0.680 0.8086661 0.9838638 0.003071198
## X16 0.12 0.660 0.540 0.9745983 0.8616899 0.008166257
## X17 0.24 0.360 0.120 0.2140342 -2.2094469 0.025768197
The boxplot underlines a slightly negative correlation, I factored quality to get the desired view. Looking at the means (‘x’), the highest appear at quality levels of 3 & 4. Here, the third quartiles are spread out quite a bit. We can also see a rather large number of outliers of volatile acidity at quality levels 5 & 6.
For chlorides, the correlation coefficient wiht quality was slighly negative. This can be confirmed by looking at the plot, medium chlorides are centered at medium quality (level 6), higher chlorides at quality level 5 and lower chlorides at quality level 7.
## item group1 vars n mean sd median trimmed mad min
## X11 1 3 1 20 170.6000 107.75833 159.5 159.5938 79.3191 19
## X12 2 4 1 163 125.2791 52.75377 117.0 124.4885 62.2692 10
## X13 3 5 1 1457 150.9046 44.08619 151.0 151.3569 45.9606 9
## X14 4 6 1 2198 137.0473 41.28622 132.0 135.3599 41.5128 18
## X15 5 7 1 880 125.1148 32.74298 122.0 123.3871 32.6172 34
## X16 6 8 1 175 126.1657 33.00633 122.0 124.2908 35.5824 59
## X17 7 9 1 5 116.0000 19.82423 119.0 116.0000 8.8956 85
## max range skew kurtosis se
## X11 440.0 421.0 0.81071555 0.095230530 24.0954953
## X12 272.0 262.0 0.20641772 -0.691207090 4.1319941
## X13 344.0 335.0 -0.03170024 0.073808470 1.1549755
## X14 294.0 276.0 0.35591447 -0.149547505 0.8806255
## X15 229.0 195.0 0.50036525 0.273050239 1.1037657
## X16 212.5 153.5 0.53693167 -0.002396868 2.4950440
## X17 139.0 54.0 -0.43928052 -1.436221665 8.8656641
The boxplot using factored quality and total sulfur dioxide shows a sligthly negative relation. However, quality level 3 shows some outliers to high total sulfur dioxide levels. Removing these shows a rather leveled picture, so that the relation between quality and total sulfur dioxide does not show a clear direction.
The scatterplot of quality and density confirms the negative relation. The line shows the density mean per quality, and is going towards lower density values, the higher the wine quality becomes.
The scatterplot of quality and alcohol confirms the positive relation. The line shows the alcohol mean per quality, and is going towards higher alcohol values, the higher the wine quality becomes. We need to watch out as we have most data points at quality levels 5-7, so should not draw the conclusion that extremely high alcohol levels will automatically lead to better wine quality ratings.
Interesting point is that density and alcohol have opposite effects on quality. Both are related to residual sugar, we will have a look at that.
Here, our suspicion of a relation between some of our independent variables is confirmed. We can observe the following based on scatter plots with a linear regression smoother (purple):
The smoother lines confirm the correlation coefficients. This example also confirms that if we want to run a stellar regression analysis of what impacts white wine quality, we need to be aware that the independent variables are related with one another. Without me being the wine expert, it might be the case that the level of residual sugar in a wine influences both density and alcohol and therefore indirectly the quality, even though sugar does not show highly significant relations to wine quality in our correlation matrix.
One last scatterplot allows the conclusion that pH values decrease when fixed acidity increases. The mean line especially between pH 3.0 - 3.4 goes rather smooth, shows some volatility below or beyond that interval.
We could confirm most selected correlations with the plots as described below each plot. In one case, quality vs. total sulfur dioxide, the plot incl. outliers confirms the correlation, but would probably not do so when removing the outliers. This showcases that the correlation matrix should not be viewed as the single source of truth for further analysis.
The observations between density, alcohol and residual sugar indicate that some of the independent variables are interlinked. One could look at some initial R-squared values between quality and other variables here. But to avoid misleading conclusions, I will do that in the multivariate section rather than here in the bivariate analysis.
It will be interesting to enrich some of the plots with a third variable in the next section. Also, overall regression models will be developed to gain a high level picture, yet not running many model fit checks, e.g. against multicolli- nearity at we can do with Python VIF.
A closer look at the quality vs. density reveals our hypothesis. The lines by factored quality are all indicating negative relations, this observation lets us conclude that the level of alcohol declines with increasing density.
The two plots above visualize the negative relation between density and alcohol, basically no matter which quality we are looking at. The facet wrapper shows that the relation is valid for many quality levels, the highest level 9 only has a small number of data points.
Our multicollinear relation between residual sugar, density and alcohol is indi- cated here. We can see the positive relation between residual sugar and density, yet it appears that the alcohol level decreases with increasing density.
Interesting is the observation that the left side of the plot is rather light blue, meaning that also wines with high residual sugar might have high alcohol levels. We can clearly see here that density is more negatively related to alcohol than residual sugar is to alcohol.
Here is a nice representation that shows the relation of pH level and fixed acidity is not influencing the quality level.
Our plots confirmed the main results from the bivariate section. Now let’s look in more depth into the data by building regression models. For these, I will first run a model with the plain variables “as they are” and then come back to the transformations that were introduced in the univariate chapter to use bell-curve like distributions.
##
## Calls:
## m1: lm(formula = I(quality) ~ density, data = ww)
## m2: lm(formula = I(quality) ~ density + alcohol, data = ww)
## m3: lm(formula = I(quality) ~ density + alcohol + residual.sugar,
## data = ww)
## m4: lm(formula = I(quality) ~ density + alcohol + residual.sugar +
## volatile.acidity, data = ww)
## m5: lm(formula = I(quality) ~ density + alcohol + residual.sugar +
## volatile.acidity + pH, data = ww)
## m6: lm(formula = I(quality) ~ density + alcohol + residual.sugar +
## volatile.acidity + pH + chlorides, data = ww)
##
## ========================================================================================================
## m1 m2 m3 m4 m5 m6
## --------------------------------------------------------------------------------------------------------
## (Intercept) 96.277*** -22.492*** 90.313*** 74.225*** 97.650*** 96.932***
## (4.003) (6.165) (12.374) (11.977) (12.392) (12.436)
## density -90.942*** 24.728*** -87.886*** -71.546*** -96.535*** -95.761***
## (4.027) (6.079) (12.317) (11.923) (12.404) (12.455)
## alcohol 0.360*** 0.246*** 0.286*** 0.253*** 0.251***
## (0.015) (0.018) (0.018) (0.018) (0.019)
## residual.sugar 0.053*** 0.052*** 0.064*** 0.064***
## (0.005) (0.005) (0.005) (0.005)
## volatile.acidity -2.059*** -2.024*** -2.016***
## (0.109) (0.109) (0.109)
## pH 0.528*** 0.524***
## (0.076) (0.077)
## chlorides -0.373
## (0.539)
## --------------------------------------------------------------------------------------------------------
## R-squared 0.094 0.192 0.210 0.264 0.271 0.271
## adj. R-squared 0.094 0.192 0.210 0.263 0.270 0.270
## sigma 0.843 0.796 0.787 0.760 0.757 0.757
## F 509.911 583.290 434.085 438.646 363.847 303.254
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -6111.983 -5831.127 -5776.812 -5604.126 -5580.287 -5580.047
## Deviance 3478.689 3101.773 3033.737 2827.187 2799.800 2799.526
## AIC 12229.967 11670.255 11563.624 11220.251 11174.574 11176.094
## BIC 12249.456 11696.241 11596.107 11259.231 11220.050 11228.067
## N 4898 4898 4898 4898 4898 4898
## ========================================================================================================
Based on our findings before, I have developed an approach with six models.
Model 1: Density has a highly significant negative influence on the quality rating. It can explain about 9.4% of the variance of quality level ratings. The denser the wine, the worse the quality is rated.
Model 2: Alcohol has a highly significant positive influence on the quality rating. It can explain about 9.8% of the variance in quality level ratings. Its effect size is less than the one of density, this can be explained by the different measure- ment scales / the very small range of density variations, leading to very strong p-value moves if density level is increased by 1. Interesting is that density in model 2 shows a positive relation to quality, opposite to all other models.
The more alcohol, the better the wine quality is rated.
Model 3: Residual sugar also has highly significant positive influence on the quality ratings. The more residual sugar, the better the wine quality is rated. The comprehensive regression model rejects the initial - bivariate correlation-based - hypothesis that the effect of residual sugar on the quality is negativ, here is a positive relation.
Model 4: Volatile acidity has highly significant negative influence on the quality ratings. The higher the volatile acidity is, the worse the wine quality is rated. The comprehensive regression model rejects the initial - bivariate correlation-based - hypothesis that the effect of volatile acidity on the quality is negative, here is a positive relation.
Model 5: pH value has a highly significant positive influence on the quality ratings. The higher the pH value, the better the quality is perceived.
Model 6: I tested chlorides on top of model 5, but it was neither significant nor did it improve the adjusted R-squared.
Adjusted R-squared: Model 1 and 2 contribute the most share of our overall adjusted R-squared, followed by volatile acidity and residual sugar. The overall value of 0.270 means that our model is able to explain about 27% of the variance in quality ratings, which is quite good but also means that there are some 70% that are not explained by the variables we built into our models.
##
## Calls:
## m1: lm(formula = I(quality) ~ density, data = ww)
## m2: lm(formula = I(quality) ~ density + alcohol, data = ww)
## m3: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar),
## data = ww)
## m4: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) +
## log10(volatile.acidity), data = ww)
## m5: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) +
## log10(volatile.acidity) + pH, data = ww)
## m6: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) +
## log10(volatile.acidity) + pH + log10(chlorides), data = ww)
##
## ===============================================================================================================
## m1 m2 m3 m4 m5 m6
## ---------------------------------------------------------------------------------------------------------------
## (Intercept) 96.277*** -22.492*** 49.045*** 46.080*** 54.106*** 52.053***
## (4.003) (6.165) (9.711) (9.329) (9.421) (9.463)
## density -90.942*** 24.728*** -46.736*** -44.982*** -54.192*** -52.260***
## (4.027) (6.079) (9.652) (9.271) (9.402) (9.439)
## alcohol 0.360*** 0.284*** 0.311*** 0.295*** 0.286***
## (0.015) (0.017) (0.016) (0.016) (0.017)
## log10(residual.sugar) 0.465*** 0.554*** 0.612*** 0.599***
## (0.049) (0.047) (0.048) (0.049)
## log10(volatile.acidity) -1.519*** -1.502*** -1.487***
## (0.075) (0.075) (0.075)
## pH 0.399*** 0.393***
## (0.074) (0.074)
## log10(chlorides) -0.193*
## (0.088)
## ---------------------------------------------------------------------------------------------------------------
## R-squared 0.094 0.192 0.207 0.269 0.273 0.274
## adj. R-squared 0.094 0.192 0.207 0.268 0.272 0.273
## sigma 0.843 0.796 0.789 0.758 0.756 0.755
## F 509.911 583.290 425.851 449.179 367.191 307.041
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -6111.983 -5831.127 -5786.594 -5588.654 -5574.194 -5571.768
## Deviance 3478.689 3101.773 3045.878 2809.382 2792.843 2790.078
## AIC 12229.967 11670.255 11583.188 11189.307 11162.387 11159.536
## BIC 12249.456 11696.241 11615.671 11228.287 11207.864 11211.509
## N 4898 4898 4898 4898 4898 4898
## ===============================================================================================================
In this regression, I used some transformed variables. These transformations make effect interpretations more difficult, therefore I will focus on the effect direction and related changes in the adjusted R-squared that improve the overall model fit.
Model 1: Density variable is the same, effects same as before.
Model 2: Alcohol variable is the same, effects same as before.
Model 3: Residual sugar is log10 transformed, effect direction the same but overall adjusted R-squared a little less than before. Yet we stick to the transformed variable as the distribution requires transformation.
Model 4: Volatile acidity is log10 transformed, effect direction changed compared to before: the higher volatile acidity, the lower the perceived wine quality. The adjusted R-squared slightly improved compared to the model before. We stick to the transformed variable as the distribution requires transformation.
Model 5: pH variable is the same, effects a little less as before.
Model 6: Chlorides is log10 transformed, effect is negative and significant. The adjusted R-squared now overall improved to 0.273. We stick to the transformed variable as the distribution requires transformation.
Adjusted R-squared: With the model modifications, we have improved our adjusted R-squared from 0.270 to 0.273, a little improvement compared to before.
Outliers:
Outlier values were not cut off for these variables to reduce complexity of the analysis, this data cleaning exercise should be thoroughly conducted with more time and is likely to improve the model performance.
The perceived quality of white wine looks rather normally distributed.
We can see here that the majority of white whines has a quality rating between 5 and 7, there are very limited counts at the outer boundaries of the histogram.
We’re focussing on quality, so here are some observations: - lower volatile acidity seems to have a slightly positive influence - same applies for chlorides and total sulfur dioxide - lower density seems to have a rather positive influence on quality - higher alcohol seems to have a strongly positive impact on quality
Amongst the others, following observations are interesting:
We need to keep in mind the rule of thumb of small relatedness starting with r >= 0.3, which we can show here for some cases. Yet, this also depends on the case number, we have plenty of cases in our dataset.
Here, our suspicion of a relation between some of our independent variables is confirmed. We can observe the following based on scatter plots with a linear regression smoother (purple):
The smoother lines confirm the correlation coefficients. This example also confirms that if we want to run a stellar regression analysis of what impacts white wine quality, we need to be aware that the independent variables are related with one another. Without me being the wine expert, it might be the case that the level of residual sugar in a wine influences both density and alcohol and therefore indirectly the quality, even though sugar does not show highly significant relations to wine quality in our correlation matrix.
##
## Calls:
## m1: lm(formula = I(quality) ~ density, data = ww)
## m2: lm(formula = I(quality) ~ density + alcohol, data = ww)
## m3: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar),
## data = ww)
## m4: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) +
## log10(volatile.acidity), data = ww)
## m5: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) +
## log10(volatile.acidity) + pH, data = ww)
## m6: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) +
## log10(volatile.acidity) + pH + log10(chlorides), data = ww)
##
## ===============================================================================================================
## m1 m2 m3 m4 m5 m6
## ---------------------------------------------------------------------------------------------------------------
## (Intercept) 96.277*** -22.492*** 49.045*** 46.080*** 54.106*** 52.053***
## (4.003) (6.165) (9.711) (9.329) (9.421) (9.463)
## density -90.942*** 24.728*** -46.736*** -44.982*** -54.192*** -52.260***
## (4.027) (6.079) (9.652) (9.271) (9.402) (9.439)
## alcohol 0.360*** 0.284*** 0.311*** 0.295*** 0.286***
## (0.015) (0.017) (0.016) (0.016) (0.017)
## log10(residual.sugar) 0.465*** 0.554*** 0.612*** 0.599***
## (0.049) (0.047) (0.048) (0.049)
## log10(volatile.acidity) -1.519*** -1.502*** -1.487***
## (0.075) (0.075) (0.075)
## pH 0.399*** 0.393***
## (0.074) (0.074)
## log10(chlorides) -0.193*
## (0.088)
## ---------------------------------------------------------------------------------------------------------------
## R-squared 0.094 0.192 0.207 0.269 0.273 0.274
## adj. R-squared 0.094 0.192 0.207 0.268 0.272 0.273
## sigma 0.843 0.796 0.789 0.758 0.756 0.755
## F 509.911 583.290 425.851 449.179 367.191 307.041
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -6111.983 -5831.127 -5786.594 -5588.654 -5574.194 -5571.768
## Deviance 3478.689 3101.773 3045.878 2809.382 2792.843 2790.078
## AIC 12229.967 11670.255 11583.188 11189.307 11162.387 11159.536
## BIC 12249.456 11696.241 11615.671 11228.287 11207.864 11211.509
## N 4898 4898 4898 4898 4898 4898
## ===============================================================================================================
In this regression, I used some transformed variables. These transformations make effect interpretations more difficult, therefore I will focus on the effect direction and related changes in the adjusted R-squared that improve the overall model fit.
Model 1: Density has a highly significant negative influence on the quality rating. It can explain about 9.4% of the variance of quality level ratings. The denser the wine, the worse the quality is rated.
Model 2: Alcohol has a highly significant positive influence on the quality rating. It can explain about 9.8% of the variance in quality level ratings. Its effect size is less than the one of density, this can be explained by the different measure- ment scales / the very small range of density variations, leading to very strong p-value moves if density level is increased by 1. Interesting is that density in model 2 shows a positive relation to quality, opposite to all other models.
The more alcohol, the better the wine quality is rated.
Model 3: Residual sugar also has highly significant positive influence on the quality ratings. The more residual sugar, the better the wine quality is rated. The comprehensive regression model rejects the initial - bivariate correlation-based - hypothesis that the effect of residual sugar on the quality is negativ, here is a positive relation.
Model 4: Volatile acidity has highly significant negative influence on the quality ratings. The higher the volatile acidity is, the worse the wine quality is rated. The comprehensive regression model rejects the initial - bivariate correlation-based - hypothesis that the effect of volatile acidity on the quality is negative, here is a positive relation.
Model 5: pH value has a highly significant positive influence on the quality ratings. The higher the pH value, the better the quality is perceived.
Model 6: Chlorides is log10 transformed, effect is negative and significant.
Adjusted R-squared: With the model modifications, we have improved our adjusted R-squared from 0.270 to 0.273, a little improvement compared to before.
The univariate analysis uncovered insights about the distrubition of the data that could be used for later work in the regression model. In theory, these might also have been applicable for the bivariate assessments, but would have made the interpretation more difficult. As I wanted to focus the interpretation on the regression models, I applied the transformations in the last stage of the project.
Bivariate and multivariate analyses led to insights that make it easier to understand which factors contribute to a high perceived white wine quality.
Outlier values were not cut off for the transformed variables to reduce complexity of the analysis. This data cleaning exercise should be thoroughly conducted with more time and is likely to improve the model performance.
Residual sugar in the log10 version has improved the model quality, but when looking at the distribution of the variable, some other transformation might lead to better effects.
As a next step, the approach and the results could be used for the red wine data and shared with my local wine dealer to discuss the results. For wine producers, these insights can serve as key to bring their portfolio wines closer to the taste of the customers.
A nice add-on to the dataset would be the price. A common hypothesis for wine goes: the more expensive, the higher the perceived quality. This has been dis- proven in many lab-scale tests I’ve seen on TV, but would be interesting to see how this relates in our sample of many thousand white and red wines.